This project uses an open dataset of NYC listings in October, 2020 from Airbnb website and aims to predict daily rental price. Pricing properties can be a nuisance for new hosts who are not familiar with the house renting market. The predictive model of this project, however, would provide hosts with a fair market price to refer to after information such as property type, number of bedrooms, and neighborhood is made available. Since the dataset is only for October, the price provided may be only suitable for October or early November.
# Import libraries
import pandas as pd
import numpy as np
from scipy import stats
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
# Import dataset
df = pd.read_csv("listings.csv")
The dataset includes the following 74 columns. It has 44666 observations in total.
df.columns
df.shape
After examining each column, we found that the following columns are uninformative in building predictive models. Information about hosts such as id and name is irrelevant to the pricing decision, so we exclude all the information of hosts. We also have columns that contains NAs only (license, bathrooms and calendar_updated), so we remove them as well. For neighbourhood_cleansed, it is not as specific as latitude and longitude, thus we remove it.
# Drop useless columns
df = df.drop(['listing_url','scrape_id',
'picture_url', 'host_id', 'host_url','host_name','host_location',
'host_about','host_thumbnail_url','calendar_updated','calendar_last_scraped',"bathrooms",
'license',"neighbourhood","neighbourhood_cleansed","host_neighbourhood","last_scraped","host_picture_url",
"host_since","host_response_rate","host_response_time","host_acceptance_rate","calculated_host_listings_count",
"calculated_host_listings_count_entire_homes","calculated_host_listings_count_private_rooms",
"calculated_host_listings_count_shared_rooms",
"host_identity_verified","host_has_profile_pic","host_is_superhost","host_listings_count"
,'host_listings_count',"host_total_listings_count","host_verifications"], axis = 1)
df.shape
Reviews need to be dropped because future information shouldn't be used to predict present price.
df = df.drop(['number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
"review_scores_rating","review_scores_accuracy","review_scores_cleanliness",
"review_scores_checkin","review_scores_communication",
"review_scores_location","review_scores_value",
"first_review","last_review","reviews_per_month"], axis = 1)
Let's take a look at how many missing values we have.
df.shape
df.isnull().sum()
The majority of NAs is in bedrooms. We will fill them later.
Firstly, we examine numeric features. We will make visualization for each variable and then handle the null values.
df["price"].head()
# Clean "price" column
df["price"] = df["price"].str.replace("$","").str.replace(",","").astype('float')
df["price"].describe()
df = df[df["price"]<1500]
We drop extreme prices that are greater than $1500 since they are not maintream rent prices and would introduce bias to our models.
df[df["price"] == 0 ][["description","name","property_type"]]
It seems listings with price = 0 are fake listings, so we remove them from our data.
df = df[df["price"] != 0]
Let's visualize our price with distribution and Q-Q plots.
fig, ax = plt.subplots()
sns.distplot(df["price"], np.linspace(0, 1500, 50),fit = stats.norm)
plt.xlim(0, 1500)
plt.show()
res = stats.probplot(df['price'], plot=plt)
Price is right skewed, so we need to log transform it to make it more normally distributed. One of the assumptions of applying linear regression is that the target variable is normally distributed for any fixed dependent variable, so if we want to better fit a linear model to our data, we need to log transform price. We add a new column corresponding to the log transformed price.
df["log_p"] = np.log1p(df["price"])
sns.distplot(df["log_p"], fit = stats.norm)
plt.show()
df[["accommodates","bedrooms", "beds"]].head(5)
sns.set(font_scale=3)
fig, axs = plt.subplots()
fig.set_size_inches(20, 10)
sns.boxplot(x = df["bedrooms"],y = df["log_p"])
#axs[0].tick_params(labelrotation=90)
#sns.boxplot(ax = axs[1],x = df["beds"],y = df["log_p"])
#sns.boxplot(ax = axs[2],x = df["accommodates"],y = df["log_p"])
plt.show()
Drop listings with 0 bed. It appears that the above three variables have a linear relationship with price.
df = df[df["beds"] != 0]
df[["accommodates","bedrooms", "beds"]].corr()
"accommadates", "bedrooms" and "beds" all show some correlation with "log_p", so we keep them. Next, let's deal with the missing values.
df[["accommodates","bedrooms", "beds"]].isnull().sum()
There are some NAs in "bedrooms" and "beds", especially in "bedrooms". The number of missing values in "beds" is small so we drop those observations. Then for "bedrooms", we use "accommodates" and "beds" to estimate its missing values since the correlation coefficients among them are pretty large, around 0.7. Let's try to use linear, random forest and decision tree models and select the best one for imputation.
df = df[~df["beds"].isnull()]
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df_notnans = df[df[["bedrooms","beds","accommodates"]].notnull().all(axis=1)][["bedrooms","beds","accommodates"]]
X_train, X_test, y_train, y_test = train_test_split(df_notnans[["accommodates","beds"]], df_notnans["bedrooms"],
train_size=0.75,
random_state=4)
reg = LinearRegression().fit(X_train, y_train)
# Fit on the train data
score = reg.score(X_train, y_train)
print("The prediction score on the train data is {:.2f}%".format(score*100))
from sklearn.ensemble import RandomForestRegressor
regr = RandomForestRegressor(max_depth=2, random_state=0)
regr.fit(X_train, y_train)
score = regr.score(X_train, y_train)
print("The prediction score on the train data is {:.2f}%".format(score*100))
from sklearn.tree import DecisionTreeRegressor
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_1.fit(X_train, y_train)
score = regr_1.score(X_train, y_train)
print("The prediction score on the train data is {:.2f}%".format(score*100))
The linear regression model is better, so we use it to impute the missing values. We make a new column "bedrooms_pred" for replacing the missing values with predicted values, and we also fill the missing values in our original column with the average. We will compare the performance of model using "bedroom_pred" with the one using "bedrooms".
score_1 = reg.score(X_test, y_test)
print("The prediction score on the test data is {:.2f}%".format(score_1*100))
# Fill NA in "bedrooms"
nanbedroom = df[~df[["bedrooms","beds","accommodates"]].notnull().all(axis=1)][["bedrooms","beds","accommodates"]]
nanbedroom["bedrooms"] = reg.predict(nanbedroom[["accommodates","beds"]]).round()
df["bedrooms_pred"] = df["bedrooms"].fillna(nanbedroom["bedrooms"])
df["bedrooms"] = df["bedrooms"].fillna(df["bedrooms"].mean())
sns.set(font_scale=1.5)
sns.distplot(df["minimum_minimum_nights"], hist=False, hist_kws=dict(ec='w'), label='minimum_minimum_nights')
sns.distplot(df["maximum_minimum_nights"], hist=False, hist_kws=dict(ec='w'), label='maximum_minimum_nights');
sns.distplot(df["maximum_maximum_nights"], hist=False, hist_kws=dict(ec='w'), label='maximum_maximum_nights');
sns.distplot(df["minimum_maximum_nights"], hist=False, hist_kws=dict(ec='w'), label='minimum_maximum_nights')
sns.distplot(df["minimum_nights_avg_ntm"], hist=False, hist_kws=dict(ec='w'), label='minimum_nights_avg_ntm')
sns.distplot(df["maximum_nights_avg_ntm"], hist=False, hist_kws=dict(ec='w'), label='maximum_nights_avg_ntm')
df[["minimum_minimum_nights", "maximum_minimum_nights",
"minimum_maximum_nights","maximum_maximum_nights",
"minimum_nights_avg_ntm", "maximum_nights_avg_ntm","price","log_p"]].corr()
The above 6 variables only display two distribution patterns, so we only keep two of them with disparate distributions.
df = df.drop(["maximum_minimum_nights","minimum_maximum_nights",
"minimum_nights_avg_ntm", "maximum_nights_avg_ntm"],axis = 1)
plt.figure(figsize=(20,10))
sns.distplot(df["availability_30"], hist=False, hist_kws=dict(ec='w'), label='availability_30')
sns.distplot(df["availability_60"], hist=False, hist_kws=dict(ec='w'), label="availability_60");
sns.distplot(df["availability_90"], hist=False, hist_kws=dict(ec='w'), label="availability_90");
sns.distplot(df["availability_365"], hist=False, hist_kws=dict(ec='w'), label="availability_365");
plt.xlabel("day")
plt.legend();
df[['availability_30','availability_60', 'availability_90',
'availability_365',"price","log_p"]].corr()
df[['availability_30','availability_60', 'availability_90',
'availability_365']].isnull().sum()
sub=df[df.price < 500]
heat=sub.plot(kind='scatter', x='longitude', y='latitude', c='price',
cmap=plt.get_cmap('jet'), colorbar=True, alpha=0.4, figsize=(10,8))
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)
heat.legend()
df[["has_availability","instant_bookable"]].head()
df[["has_availability","instant_bookable"]].isnull().sum()
ax = sns.boxplot(x=df["has_availability"], y=np.log1p(df["price"]))
plt.show()
df[df["has_availability"] == "f"]["price"]
All observations are "t" except one being "f", so we drop this column.
df = df.drop("has_availability", axis = 1)
ax = sns.boxplot(x=df["instant_bookable"], y=np.log1p(df["price"]))
plt.show()
df["bathrooms_text"].unique()
This column mingles numeric and categorical values together, so we need to seperate them.
df["bathrooms"] = df["bathrooms_text"].str.replace("baths","").str.replace("shared","").str.replace("bath","").str.replace("private","")
df["bathrooms"] = df["bathrooms"].str.replace("Shared half-","0.5").str.replace("Private half-","0.5").str.replace("Half-","0.5").astype(float)
sns.barplot(x = "room_type",y = "bathrooms",data = df)
plt.show()
Create columns "bathrooms" and "bath_type" indicating the number of bathrooms and the type of bathrooms.
df["bathrooms"] = df["bathrooms"].fillna(1)
df["bathrooms_text"] = df["bathrooms_text"].fillna("1")
df[["bathrooms","price"]].corr()
df["bath_type"] = np.nan
df["bath_type"] = df[df["bathrooms_text"].str.contains("shared")]["bath_type"].fillna("shared")
df["bath_type"] = df["bath_type"].fillna("private")
df = df.drop(["bathrooms_text"], axis = 1)
sns.set(font_scale=3)
plt.figure(figsize=(20,10))
ax = sns.boxplot(x=df["bath_type"], y=df["log_p"])
plt.show()
plt.figure(figsize=(20,10))
ax = sns.boxplot(x=df["room_type"], y=df["log_p"])
plt.show()
df["property_type"].unique()
len(df["property_type"].unique())
Since "room_type" specifies whether the listing is private, entire, shared, we no longer need this information in "property_type". Therefore, we clean them out.
def clean_property(str):
df["property_type"] = df["property_type"].str.replace(str,"")
clean_property("Entire")
clean_property("Shared")
clean_property("in")
clean_property("Private")
df["property_type"].unique()
Now we are left with the real property type, and we need to extract the property words such as house, apartment, hotel, etc. Those uncommon property types with counts of less than 10 are categorized as other.
def further_clean_property(str):
df.loc[df["property_type"].str.contains(str,case=False), "property_type"] = str
further_clean_property("apartment")
df.loc[df["property_type"].str.contains("apt",case=False), "property_type"] = "apartment"
further_clean_property("guesthouse")
further_clean_property("hotel")
further_clean_property("hostel")
further_clean_property("loft")
further_clean_property("suit")
further_clean_property("villa")
further_clean_property("condomium")
further_clean_property("resort")
further_clean_property("castle")
further_clean_property("cave")
further_clean_property("cottage")
further_clean_property("barn")
further_clean_property("bungalow")
further_clean_property("townhouse")
further_clean_property("guesthouse")
further_clean_property("bed and breakfast")
further_clean_property("floor")
further_clean_property("boat")
df.loc[df["property_type"].str.contains("room house",case=False), "property_type"] = "house"
df.loc[df["property_type"].str.contains(" house",case=False), "property_type"] = "house"
df.loc[df["property_type"].str.contains("room",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("Camper/RV",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("cottage",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("castle",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("barn",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("cave",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("Casa particular",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("timeshare",case=False), "property_type"] = "other"
df.loc[df["property_type"].str.contains("Lighthouse",case=False), "property_type"] = "house"
df["property_type"].value_counts()
plt.figure(figsize=(20,10))
ax = sns.boxplot(x=df["property_type"], y=df["log_p"])
plt.xticks(rotation=70)
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.show()
plt.figure(figsize=(20,10))
ax = sns.boxplot(x=df["neighbourhood_group_cleansed"], y=df["log_p"])
plt.show()
df.isnull().sum()
text_data = pd.DataFrame(df[["name","description","neighborhood_overview","amenities"]])
df = df.drop(["name","description","neighborhood_overview","amenities"], axis = 1)
df.to_csv('listings_clean.csv', index=False)
text_data.to_csv("text.csv", index = False)
def convert_binary_column_dummy(column,df):
df[column] = df[column].str.replace("t","1").str.replace("f","0").astype(int)
convert_binary_column_dummy("instant_bookable",df)
df["bath_type"] = df["bath_type"].str.replace("private","1").str.replace("shared","0").astype(int)
df = pd.get_dummies(df)
df.dtypes
df.isnull().sum()
We use LASSO to deal with multicollinearity and select powerful features. It is crucial to normalize our features before fitting LASSO regression. Firstly, let's see if there is a difference between using "bedrooms" and "bedrooms_pred".
from sklearn.linear_model import LassoCV
def mse(y, yhat):
return np.mean((y - yhat)**2)
def rmse(y, yhat):
return mse(y, yhat)**0.5
models = {}
rmse_scores = []
mae_scores = []
r_squared = []
# Model with simple imputation
from sklearn import preprocessing
from sklearn.metrics import mean_absolute_error
y = df.price
keep_cols1 = [x for x in df.columns.values if x not in ["bedrooms_pred","id","price","log_p"]]
X1 = df[keep_cols1]
X1_train, X1_test, y_train, y_test = train_test_split(X1, y,train_size=0.8, random_state=4)
clf1 = LassoCV(cv=5,max_iter = 10000)
X1_scaled = preprocessing.scale(X1_train)
clf1.fit(X1_scaled, y_train)
y_pred = clf1.predict(X1_scaled)
rmse_scores.append(rmse(y_pred, y_train))
mae_scores.append(mean_absolute_error(y_train,y_pred))
r_squared.append(clf1.score(X1_scaled, y_train))
print(rmse(y_pred, y_train))
print(mean_absolute_error(y_train,y_pred))
models["Lasso with simple imputation"] = clf1
# Model with predicted bedrooms
keep_cols2 = [x for x in df.columns.values if x not in ["bedrooms","id","price","log_p"]]
X2 = df[keep_cols2]
X2_train, X2_test, y_train, y_test = train_test_split(X2, y,train_size=0.8, random_state=4)
clf2 = LassoCV(cv=5,max_iter = 10000)
X2_scaled = preprocessing.scale(X2_train)
clf2.fit(X2_scaled, y_train)
y_pred = clf2.predict(X2_scaled)
rmse_scores.append(rmse(y_pred, y_train))
mae_scores.append(mean_absolute_error(y_train,y_pred))
r_squared.append(clf2.score(X2_scaled, y_train))
print(rmse(y_pred, y_train))
print(mean_absolute_error(y_train,y_pred))
models["Lasso on price"] = clf2
The model with "bedrooms_pred" has a lower RMSE and MAE, so we consider our imputation method has improved the model. Then let's see if the linear model really fits our data.
import plotly.graph_objects as go
fig = go.Figure()
data_scatter = go.Scatter(x = y_pred, y = y_pred - y_train, mode = 'markers',
line=dict(color='red'), name = 'residuals vs. fitted values')
fig.add_trace(data_scatter)
fig.update_layout(xaxis_title=r"Fitted Values",
yaxis_title="Residuals")
The residuals are clearly not randomly scattered, so the linear regression doesn't fit the data well.
fig = go.Figure()
fig.add_trace(go.Scatter(x = y_train, y = y_pred, mode = 'markers', line=dict(color='red')))
fig.update_layout(xaxis_title = 'actual price', yaxis_title = 'predicted price')
Our model even predicts negative prices. We need to fix this. Let's first try log transformation on our y to prevent negative fitted values.
# Model after log transformation
clf3 = LassoCV(cv=5,max_iter = 10000)
y3_train = np.log1p(y_train)
X3_scaled = X2_scaled
clf3.fit(X3_scaled, y3_train)
y3_pred = clf3.predict(X3_scaled)
rmse_scores.append(rmse(np.expm1(y3_pred), np.expm1(y3_train)))
mae_scores.append(mean_absolute_error(np.expm1(y3_train),np.expm1(y3_pred)))
r_squared.append(clf3.score(X1_scaled, y3_train))
print(rmse(np.expm1(y3_pred), np.expm1(y3_train)))
print(mean_absolute_error(np.expm1(y3_train),np.expm1(y3_pred)))
models["Lasso on log_p"] = clf3
fig = go.Figure()
fig.add_trace(go.Scatter(x = np.expm1(y3_train), y = np.expm1(y3_pred), mode = 'markers', line=dict(color='red')))
fig.update_layout(xaxis_title = 'actual price', yaxis_title = 'predicted price')
fig = go.Figure()
bar = go.Bar(x = list(models.keys())[1:], y = mae_scores[1:], name="CV MAE")
fig.add_trace(bar)
fig.update_layout(xaxis_title=r"model",
yaxis_title="MAE")
fig = go.Figure()
bar = go.Bar(x = list(models.keys())[1:], y = r_squared[1:], name="$R^2$")
fig.add_trace(bar)
fig.update_layout(xaxis_title=r"model",
yaxis_title="$R^2$")
The above bar plot shows that the MAE decreased. After the log transformation, the $R^2$ improves drastically.
fig = go.Figure()
data_scatter = go.Scatter(x = np.expm1(y3_pred), y = np.expm1(y3_pred) - np.expm1(y3_train), mode = 'markers',
line=dict(color='red'), name = 'residuals vs. fitted values')
fig.add_trace(data_scatter)
fig.update_layout(xaxis_title=r"Fitted Values",
yaxis_title="Residuals")
The residual plot still has some patterns in it, so linear regression model might still not be suitable for price prediction even though we have log transformed price.
fig = go.Figure()
fig.add_trace(go.Scatter(x = np.expm1(y3_train), y = np.expm1(y3_pred), mode = 'markers', line=dict(color='red')))
fig.update_layout(xaxis_title = 'actual price', yaxis_title = 'predicted price')
By taking the natual log of y, we successfully have gotten rid of the negative fitted values. To sum up, the regression model with $log(y)$ reduces the MAE and eliminates the negative predicted price, so log transformation improved our model.
We first try to improve our model by adding more features. We haven't examined "amenities" yet, and we can extract keywords from it to create new categorical features.
def parse(am_st):
arr=am_st.split(',')
am=[s.translate(s.maketrans('','','"')).strip() for s in arr if s!='']
return(am)
text_data['amenities']=text_data.apply(lambda x:parse(x.amenities),axis=1)
from sklearn.preprocessing import MultiLabelBinarizer
mlb=preprocessing.MultiLabelBinarizer()
amenities=pd.DataFrame(mlb.fit_transform(text_data['amenities']),columns=mlb.classes_, index=df.index)
amenities.columns = [s.replace("[","") for s in amenities.columns]
amenities.columns = [s.replace("]","") for s in amenities.columns]
amenities = amenities.groupby(lambda x:x, axis=1).sum()
amenities = amenities.drop([""],axis = 1)
amenities.shape
df=pd.DataFrame(pd.concat([df,amenities],axis=1))
We use LASSO regression with 5-fold cross validation to help us select powerful amenity features.
clf4 = LassoCV(cv=5, max_iter = 10000)
y = df.log_p
keep_cols = [x for x in df.columns.values if x not in ["bedrooms_pred","id","price","log_p","longitude","latitude"]]
X = df[keep_cols]
X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8, random_state=4)
X_scaled = preprocessing.scale(X_train)
clf4.fit(X_scaled, y_train)
y_pred = clf4.predict(X_scaled)
rmse_scores.append(rmse(np.expm1(y_pred), np.expm1(y_train)))
mae_scores.append(mean_absolute_error(np.expm1(y_train),np.expm1(y_pred)))
r_squared.append(clf4.score(X_scaled, y_train))
print(rmse(np.expm1(y_pred), np.expm1(y_train)))
print(mean_absolute_error(np.expm1(y_train),np.expm1(y_pred)))
models["Lasso with amenity"] = clf4
Both the RMSE and MAE decreased after we included more features in our model.
fig = go.Figure()
bar = go.Bar(x = list(models.keys())[1:], y = mae_scores[1:], name="CV MAE")
fig.add_trace(bar)
fig.update_layout(xaxis_title=r"model",
yaxis_title="MAE")
kept = ~np.isclose(clf4.coef_, 0)
features = X.columns[kept]
print(len(features))
clf4.alpha_
len(X.columns)
LASSO selects 216 out of 357 features. The optimal hyperparameter produced by cross validation is 0.00075.
from sklearn.ensemble import RandomForestRegressor
Since our data is quite large, it takes more than 10 minutes to train the random forest model without tuning the parameters. Therefore, we need to tune them straight on to speed the training process up. There are 6 parameters that need to be tuned:
y = df.price
keep_cols = [x for x in df.columns.values if x not in ["bedrooms_pred","id","price","log_p","long\lat"]]
X = df[keep_cols]
X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8, random_state=4)
train_results = []
min_samples_split = [10, 50, 100, 150,200, 250]
for sam in min_samples_split:
rf = RandomForestRegressor(n_estimators = 10, max_depth = 3,min_samples_split = sam, random_state=123)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_train)
train_results.append(rmse(y_pred, y_train))
train_results
min_samples_split doesn't affect the RMSE much, so we dont' tune this parameter.
train_results = []
max_leaf_nodes = [10, 50, 100, 200]
for leaf in max_leaf_nodes:
rf = RandomForestRegressor(n_estimators = 10, max_depth = 3,max_leaf_nodes = leaf, random_state=123)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_train)
train_results.append(rmse(y_pred, y_train))
train_results
This parameter doesn't seem to matter as well. We don't need to tune it.
train_results = []
min_samples_leaf = [1,5,10,50,100]
for leaf in min_samples_leaf:
rf = RandomForestRegressor(n_estimators = 10, max_depth = 3,min_samples_leaf = leaf, random_state=123)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_train)
train_results.append(rmse(y_pred, y_train))
train_results
This parameter doesn't change much as we tune it. Therefore, we don't tune it.
def rmse_score(model, X, y):
return np.sqrt(np.mean((y - model.predict(X))**2))
from sklearn.model_selection import cross_val_score
train_results = []
# Tuning max_depth
n_depth = np.linspace(1, 10, 10)
for depth in n_depth:
rf = RandomForestRegressor(n_estimators = 10, max_depth = depth,random_state=123)
score = cross_val_score(rf,X_train, y_train,cv=10,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
When max depth gets to 7, the decrease in the RMSE slows down, and also it took way too long to run the code when depth is beyond 7, so we choose 7 as the optimal parameter.
train_results = []
# Tuning n_estimators
n_estimators = [20, 30, 40, 50, 80]
for est in n_estimators:
rf = RandomForestRegressor(n_estimators = est, max_depth = 2,random_state=123)
score = cross_val_score(rf,X_train, y_train,cv=10,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
We can stop at 50 trees as increasing the number of trees doesn't significantly decrease the RMSE.
train_results = []
max_features = [1,5,10,50,100,150,200]
for f in max_features:
rf = RandomForestRegressor(n_estimators = 10, max_depth = 3,max_features = f, random_state=123)
rf.fit(X_train, y_train)
score = cross_val_score(rf,X_train, y_train,cv=10,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
We will choose 150 to be max_features. Finally, we have tuned all the parameters. Now we compare the random forest model with Lasso.
rf = RandomForestRegressor(n_estimators = 50, max_depth = 7,max_features = 150, random_state=123)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_train)
rmse_scores.append(rmse(y_pred, y_train))
mae_scores.append(mean_absolute_error(y_train,y_pred))
r_squared.append(rf.score(X_train, y_train))
print(rmse(y_pred, y_train))
print(mean_absolute_error(y_train,y_pred))
models["Random forest with amenity"] = rf
fig = go.Figure()
bar = go.Bar(x = list(models.keys())[1:], y = mae_scores[1:], name="CV MAE")
fig.add_trace(bar)
fig.update_layout(xaxis_title=r"model",
yaxis_title="MAE")
fig = go.Figure()
bar = go.Bar(x = list(models.keys())[1:], y = rmse_scores[1:], name="CV RMSE")
fig.add_trace(bar)
fig.update_layout(xaxis_title=r"model",
yaxis_title="RMSE")
The random forest model has the lowest RMSE and MAE. Overall the random forest model is the best so far.
import xgboost as xgb
data_dmatrix = xgb.DMatrix(data=X,label=y)
xg_reg = xgb.XGBRegressor(objective ='reg:linear', seed = 123)
xg_reg.fit(X_train,y_train)
pred = xg_reg.predict(X_train)
print(rmse(pred,y_train))
print(mean_absolute_error(pred,y_train))
Without any tuning, the model gives us the RMSE of 57 and MAE of 33, which is the best result so far, but overfitting is likely. Hence, we need to tune the parameters to avoid overfitting. XGBoost has the following core parameters:
We will be only tuning them because they are the parameters that have significant impact on the accuracy of our model.
train_results = []
# Tuning max_depth
max_depth = [3,4,5,6,7,8]
for depth in max_depth:
xg = xgb.XGBRegressor(max_depth = depth, random_state=123)
score = cross_val_score(xg,X_train, y_train,cv=5,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
When max_depth is 5, the RMSE value is the lowest.
train_results = []
# Tuning n_estimators
n_estimators = [10, 20, 30, 40, 50, 80, 100]
for est in n_estimators:
xg = xgb.XGBRegressor(n_estimators = est, random_state=123)
score = cross_val_score(xg,X_train, y_train,cv=5,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
80 is appropriate because the RMSE doesn't go down much once it hits 50.
train_results = []
# Tuning learning_rate
learning_rate = [0.01,0.05, 0.1,0.3]
for rate in learning_rate:
xg = xgb.XGBRegressor(learning_rate = rate, random_state=123)
score = cross_val_score(xg,X_train, y_train,cv=5,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
0.1 is the best choice, producing the lowest RMSE on validation set.
train_results = []
# Tuning min_child_weight
min_child_weight = [3,4,5,6,7]
for child in min_child_weight :
xg = xgb.XGBRegressor(min_child_weight = child, random_state=123)
score = cross_val_score(xg,X_train, y_train,cv=5,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
6 is the best one.
train_results = []
# Tuning colsample_bylevel
colsample_bylevel = [0.3,0.4,0.5,0.6,0.7]
for level in colsample_bylevel :
xg = xgb.XGBRegressor(colsample_bylevel = level, random_state=123)
score = cross_val_score(xg,X_train, y_train,cv=5,scoring=rmse_score)
train_results.append(np.mean(score))
train_results
colsample_bylevel = 0.6 gives us the lowest RMSE. Now every parameter has been tuned, so let's build our XGBoost model.
xg = xgb.XGBRegressor(colsample_bylevel=0.6,
learning_rate=0.1, max_depth= 5,
min_child_weight=6,
n_estimators=80)
xg.fit(X_train,y_train)
pred = xg.predict(X_train)
print(rmse(pred,y_train))
print(mean_absolute_error(pred,y_train))
models["XGBoost"] = xg
rmse_scores.append(rmse(pred, y_train))
mae_scores.append(mean_absolute_error(y_train,pred))
r_squared.append(xg.score(X_train, y_train))
fig = go.Figure()
bar = go.Bar(x = list(models.keys())[1:], y = mae_scores[1:], name="CV MAE")
fig.add_trace(bar)
fig.update_layout(xaxis_title=r"model",
yaxis_title="MAE")
fig = go.Figure()
bar = go.Bar(x = list(models.keys())[1:], y = rmse_scores[1:], name="RMSE")
fig.add_trace(bar)
fig.add_trace(go.Bar(x = list(models.keys())[1:], y = mae_scores[1:], name="MAE"))
fig.update_layout(xaxis_title=r"model",
yaxis_title="Error")
As it is shown above, XGBoost outperforms other models with its low RMSE as well as MAE. Therefore, we decide to use XGBoost for price prediction. It is finally the time to look at its performance on test set.
pred = xg.predict(X_test)
print(rmse(pred,y_test))
print(mean_absolute_error(pred,y_test))
fig = go.Figure()
bar = go.Bar(x = ["XGBoost on train","XGBoost on test"], y = [41.53482780097031,42.87623038816402], name="MAE")
fig.add_trace(bar)
fig.update_layout(xaxis_title=r"model",
yaxis_title="MAE")
fig = go.Figure()
bar = go.Bar(x = ["XGBoost on train","XGBoost on test"], y = [77.94973409115237,80.72161874970276], name="RMSE")
fig.add_trace(bar)
fig.add_trace(go.Bar(x = ["XGBoost on train","XGBoost on test"], y = [41.53482780097031,42.87623038816402], name="MAE"))
fig.update_layout(xaxis_title=r"model",
yaxis_title="Error")
The difference between RMSE and MAE scores on train and test sets are very small. Therefore, our model is appropriately tuned and there is no overfitting issue.
features = X_test.columns
importances = xg.feature_importances_
indices = np.argsort(importances)
# customized number
num_features = 20
plt.figure(figsize=(10,10))
plt.title('Feature Importances')
# only plot the customized number of features
plt.barh(range(num_features), importances[indices[-num_features:]], color='b', align='center')
plt.yticks(range(num_features), [features[i] for i in indices[-num_features:]])
plt.xlabel('Relative Importance')
plt.show()
The top 20 most important features are displayed above. Whether room type is private is the most important feature when it comes to rent price prediction.